An Introduction to Machine Learning for Quants

ML Forum 2017

Mick Cooney mickcooney@gmail.com

2017-10-05

Introduction

Who Am I?


Former quant


Statistical volatility arbitrage


Options on equities and equity indexes

Layout


What is ML?


Survey of ML


Miscellanea

What is Machine Learning?

Different meanings

Terminology


Machine Learning


Artificial Intelligence


Statistical Learning


Applied Statistics

Historical context important


ML primarily from CS / EE

‘Engineering’ mentality

Data Format


Tabular based


##    default student  balance   income
## 1       No      No  729.526 44361.63
## 2       No     Yes  817.180 12106.13
## 3       No      No 1073.549 31767.14
## 4       No      No  529.251 35704.49
## 5       No      No  785.656 38463.50
## 6       No     Yes  919.589  7491.56
## 7       No      No  825.513 24905.23
## 8       No     Yes  808.668 17600.45
## 9       No      No 1161.058 37468.53
## 10      No      No    0.000 29275.27
## 11      No     Yes    0.000 21871.07
## 12      No     Yes 1220.584 13268.56
## 13      No      No  237.045 28251.70
## 14      No      No  606.742 44994.56
## 15      No      No 1112.968 23810.17

Exchangeability

Predictive Focus


Predictive accuracy


De-emphasises inference / uncertainty / explainability

Discoverability of model parameters

Example of linear models

Production


Scaling issues


Automated ML pipelines


Software engineering

Model Validation

Overfitting

Bias-Variance Tradeoff

\[\begin{eqnarray*} \text{Bias} &=& \text{under-complexity error} \\ \text{Variance} &=& \text{over-complexity error} \end{eqnarray*}\]

Cross-validation


Training-test split


\(k\)-fold


Train-validation-test split

Supervised Learning

Labelled data


\[ \begin{eqnarray*} \text{Discrete output} &\rightarrow& \text{Categorisation} \\ \text{Continuous output} &\rightarrow& \text{Regression} \end{eqnarray*} \]

Linear Models


\[ y = \beta_0 + \beta_1 \phi_1(X_1) + ... + \beta_n \phi_n(X_n) + \epsilon \]


Linear in parameters \(\beta\)

Tree Methods


##    default student  balance   income
## 1       No      No  729.526 44361.63
## 2       No     Yes  817.180 12106.13
## 3       No      No 1073.549 31767.14
## 4       No      No  529.251 35704.49
## 5       No      No  785.656 38463.50
## 6       No     Yes  919.589  7491.56
## 7       No      No  825.513 24905.23
## 8       No     Yes  808.668 17600.45
## 9       No      No 1161.058 37468.53
## 10      No      No    0.000 29275.27
## 11      No     Yes    0.000 21871.07
## 12      No     Yes 1220.584 13268.56
## 13      No      No  237.045 28251.70
## 14      No      No  606.742 44994.56
## 15      No      No 1112.968 23810.17

Simple to understand


Highly explainable


Prone to overfitting

Random Forest


Ensemble of trees


Aggregate low-bias trees to reduce variance

Sample of rows, constrain splits


Self-tuning (mostly)

Boosting


Ensemble of trees


Aggregate low-variance trees to reduce bias

Probably most performant approach


Tuning more involved

Kernel Methods


Uses kernel functions


Avoids co-ordinate transforms

Support Vector Machines (SVM)


Geometric method


Divides ‘feature space’ into regions

Gaussian Processes

Neural Networks

Unsupervised Learning

Unlabelled data

Clustering

k-Means

Real-world Example

Dimensionality Reduction


Many variables (sometimes thousands)


Correlated / dependent / useless

PCA / SVD

Reduce dimensionality without losing information

Natural Language Processing

Latent Dirichlet Allocation (LDA)


Unsupervised (clustering)


Topic modelling


Lots of functionality

word2vec


Words as vectors


Semantic meaning


\[ \text{King} - \text{Male} + \text{Female} \approx \text{Queen} \]

Summary

Thank You!!!


mickcooney@gmail.com


https://github.com/kaybenleroll/dublin_r_workshops